What

makes computational communication science (ir)reproducible?

Johannes B. Gruber, Tim Schatto-Eckrodt, Chung-hong Chan

Universiteit van Amsterdam, Universität Hamburg, GESIS

2023-10-05

Computational Communication Science

the endeavor to understand human communication by developing and applying digital tools that often involve a high degree of automation in observational, theoretical, and experimental research.

Hilbert et al. (2019)

Computational Social Science

Adamic & Glance (2005)

Communication Science

Whatever piece of scholarship finds room in communication associations, journals, departments, and schools is communication research.

Waisbord (2019) - Communication: A Post-Discipline

Computational Communication Research (I/II)

Computational Communication Research (II/II)

CCR demands transparent and reproducible research. … As digital data and analysis code can be shared easily, computational research can be at the forefront of the open science philosophy … Most articles in CCR should be accompanied by an online appendix in a form that encourages reproducibility and reusability.

Van Atteveldt et al. (2019)

Question

Really?

“Subjects” and method

  • “Snapshot” date: 2023-05-25
  • All articles with empirical analyses published in CCR up to this date: 30
  • Preregistered the criteria for reproducibility
  • Code execution (+/- code rewrite)
  • “Postmortem analysis”

Computational reproducibility

A result is reproducible when the same analysis (1) steps performed on the same dataset (2) consistently produce the same answer (3). This reproducibility of the result can be checked (4) by the original investigators and other researchers within a local computational environment.

Schoch, Chan, Wagner, & Bleier. (2023)

Operationalization: Preregistered criteria for failure

  1. No shared code (Not the same analysis)
  2. No shared data (Not the same dataset)
  3. Code execution failure, despite code rewrite (Uncheckability for consistency)
  4. Technically executable, but results with major deviations (Not the same answer)

Let’s make a guess now

CCR papers are computationally reproducible.

? / 30

Code and Data archive

  • Archive data and code from OSF on the Snapshot date
  • Github: Clone the repo up to the last commit before the Snapshot date

Code execution environment

  • Docker container
  • Rocker rver + an optional Python pyenv layer
  • R 4.3.0 and Python 3.11.3 (Snapshot date)

Dependencies (I/II)

  • Code scanning for dependencies: renv (R) and pipreqs (Python)
  • Installation: Posit Public Package Manager (set to Snapshot date)

Dependencies (II/II)

docker-compose.yml

version: "3.8"
services:
  repro_4-2:
    build: .
    network_mode: "host"
    environment:
      - SNAPSHOT_DATE=2023-05-25
      - PYTHON_VERSION=3.11.3
    volumes:
       - ./output:/usr/local/src/output # [local path]:[container path]

install_dependencies.R

snap_date <- as.Date(Sys.getenv("SNAPSHOT_DATE"))
repo <- paste0("https://packagemanager.posit.co/cran/", snap_date)
options(repos = c(REPO_NAME = repo))

install_dependencies.sh

pip config set global.index-url https://packagemanager.posit.co/pypi/$SNAPSHOT_DATE/simple

Code execution

  • Allow rewrite
  • Patch the code by either sed or patch
  • Batch execution via a single bash script inside the container
git init
git remote add origin https://github.com/xxx/yyy
git fetch
git checkout ed5bd12a590b06e9a32058fe2eec57f38cd3f1e2

Rscript install_dependencies.r

unzip Docs.zip

mkdir Data ## undocumented

## Code execution
Rscript 01_data_processing.R
bash 02_glove.sh
Rscript 03_dictionary_generation.R
##04 hardcoded the plan; and multisession is dep
sed -i 's/multiprocess/multicore/' 04_sentiment_validation.R
Rscript 04_sentiment_validation.R
Rscript 05_sentiment_scoring.R

cp -r Data /usr/local/src/output

Postmortem analysis

  • Minor deviations:

deviations in the decimals or obvious typographical errors

Crüwell et al. (2023)

  • Major deviations: deviations beyond minor

Qualitative Analysis

  • Document what makes computational communication science (ir)reproducible.

Results

Criteria 1 and 2

  • 13 no shared code (12 also no shared data)
  • 3 shared code but no data

Current result

14 / 30

A B C D E F G H I J K L M N

Articles with code and data

B E G H J M N

Articles which the provided code can be executed

B E H J M N

Articles which produce the same answer

Final answer

6 / 30

B

E

H

J

M

N

B (Major code rewrite, missing data for an analysis in the appendix)

E (Minor code rewrite)

H (Major code rewrite)

J (Major code rewrite)

M (No code rewrite, run for a month)

N (Minor code rewrite, missing data for an analysis in the appendix)

B (Major code rewrite, missing data for an analysis in the appendix)

E (Minor code rewrite)

H (Major code rewrite)

J (Major code rewrite)

M (No code rewrite, run for a month)

N (Minor code rewrite, missing data for an analysis in the appendix)

Irreproducible

  • Incomplete sharing of data
  • Social media data antics
  • Incomplete sharing of code
  • Code quality issue
  • Insufficient documentation
  • Undocumented computational environment

Reproducible

  • Research compendium
    • Documentation, computational environment etc.
  • Sequential execution (and check in batch)
    • 00_preprocess.R, 01_validation.R
    • Makefile, dodo
    • R CMD BATCH, jupyter nbconvert --execute
  • Literate programming
  • Reduce external dependencies

Recommendations for the subfield

  • доверяй, но проверяй
  • Incentivize data and code sharing
  • Protected access to sensitive data
  • Standardize the computational environment

Summary

  • 6 / 30 or 1 / 30
  • 14: not due to technical issues
  • Incomplete sharing is the major technical issue (not computational environment)
  • Improve research practices